This report present the results of a set of simulation studies used to validate methods applied in the Variation-Based Distance & Similarity Modeling (VADIS) method introduced by Szmrecsanyi, Grafmiller & Rosseel (2019). The VADIS method builds upon techniques in comparative sociolinguistics and dialectometry for quantifying the similarity between varieties and dialects as captured by correspondences among the ways in which language users choose between different ways of saying the same thing. For details of the method and theoretical motivation see Szmrecsanyi, Grafmiller & Rosseel (2019). The {VADIS} R package can be found at https://github.com/jasongraf1/VADIS.

1 VADIS in a nutshell

The VADIS method is designed to measure the degree of (dis)similarity among “variable grammars” (Tagliamonte 2012) of different dialects or varieties, where a variable grammar is understood as the set of constraints (a.k.a. predictors or “conditioning factors”) governing the choice between two or more linguistic variants. Variants can be individual lexical items (sneakers vs. trainers vs. tennis shoes), grammatical constructions (give me the book vs. give the book to me), or phonetic realizations of a particular phoneme (e.g. [ʊ] vs. [ʌ] pronunciations of the STRUT vowel).

The method takes inspiration from Comparative Sociolinguistics, which evaluates the relatedness between varieties and dialects based on how similar the conditioning of variation is in these varieties (Poplack & Tagliamonte 2001; Tagliamonte 2013). Comparative sociolinguists rely on three lines of evidence to determine relatedness:

  1. Are the same constraints significant across varieties?
  2. Do the constraints have the same strength across varieties?
  3. Is the relative explanatory importance of the constraints similar?

Similarity assessed via this approach is often interpreted as historical and genetic relatedness in variationist studies (e.g. Poplack & Tagliamonte 2001; Claes 2016; Childs 2017), though the approach has potential applications for investigating language contact phenomena (e.g. Ortin & Fernandez-Florez 2019) as well as individual level variation (e.g. Blaxter et al. 2019). VADIS draws inspiration from this literature and adapts the comparative sociolinguistics method so that it can be applied to datasets sampling (a) more than a handful dialects or varieties, and (b) more than one variation phenomenon at a time. This is accomplished through more rigorous quantification of the variability among the representative variable grammars.

1.1 The analysis pipline

In practice, applying the method to a given alternation consists of the following steps:

VADIS Step 1. Identify the set of constraints or predictors P that govern the choice among variants, based on prior research and/or theoretical knowledge of the variable in question. Create datasets of carefully selected observations of said variable(s) across different varieties following established principles and practices in variationist research. Annotate these datasets for the relevant features corresponding to the set of predictors. VADIS Step 2. Fit a regression model (specifically mixed-effects logistic regression models) to each variety-specific dataset, where the model structure will depend on the set of predictors thought to influence the variable. VADIS Step 3. Using the variety-specific regression models, determine cross-variety similarity based on predictor significance. In this step, the similarity between two varieties is proportional to the extent to which the varieties overlap in which predictors significantly regulate variant choice. Distance is measured as the squared Euclidean distance, divided by the total number of constraints. VADIS Step 4. Using the variety-specific regression models, determine cross-variety similarity based on predictor effect size and direction. In this step, the similarity between two varieties is proportional to the (Euclidean) distance between the two models’ coefficient estimates. These distances are then scaled to fall between 0 and 1 (see Szmrecsanyi, Grafmiller & Rosseel 2019:5 for details). VADIS Step 5. Fit a random forest model to each variety-specific dataset, where the model structure will include (as much as possible) the same predictors as in Step 2.1 VADIS Step 6. Using the variety-specific random forest models, determine cross-variety similarity based on the relative predictive importance of the predictors. In this last step, the similarity between two varieties is proportional to the rank correlation between the two models’ predictor importance rankings.

The method provides two main outputs for assessing similarity among varieties: a set of 3 distance matrices, scaled from 0 to 1, reflecting the pairwise distances between datasets based on measures from each of the 3 lines of evidence; and 3 sets of similarity scores which reflect the average similarity of each variety to all others (calculated as 1 - the average distance). Variations of this method were used by Grafmiller & Szmrecsanyi (2018), and the method in full is introduced and described in Szmrecsanyi, Grafmiller & Rosseel (2019).

We exemplify the method with the two tables below. Table 1.1 shows a table of hypothetical coefficients from regression models of 3 varieties, similar to tables of factor weights in traditional variationist work.

Table of predictor weightings (logit scaled) for 5 simulated varieties. From Szmrecsanyi et al. (2019:5).

Figure 1.1: Table of predictor weightings (logit scaled) for 5 simulated varieties. From Szmrecsanyi et al. (2019:5).

Table 1.2 shows a matrix of pairwise distances derived from the first table, where greater values reflect greater dissimilarity between two varieties (0 indicates identical varieties). Such distance matrices work essentially like distance grids in road atlases, which specify geographic distances between locations.

Table of predictor weightings (logit scaled) for 5 simulated varieties. From Szmrecsanyi et al. (2019:5).

Figure 1.2: Table of predictor weightings (logit scaled) for 5 simulated varieties. From Szmrecsanyi et al. (2019:5).

Distance matrices of the kind above are the basic inputs of classical dialectometry, and are extremely useful for visualizing similarities in a number of ways. For example, Szmrecsanyi et al. (2019) use the VADIS method to derive a distance matrix from the various predictor importance rankings for the particle placement alternation in nine English varieties (Figure 1.3).

Figure 1 from Szmrecsanyi et al. (2019): VADIS distance matrix for the 3rd line of evidence in the particle placement alternation in 9 varieties of English. Scores range between 0 (maximal similarity) and 1 (maximal dissimilarity).

Figure 1.3: Figure 1 from Szmrecsanyi et al. (2019): VADIS distance matrix for the 3rd line of evidence in the particle placement alternation in 9 varieties of English. Scores range between 0 (maximal similarity) and 1 (maximal dissimilarity).

From this matrix they create a multidimensional scaling plot to map the relative similarity in these rankings across the nine varieties, were greater proximity on the MDS plot represents greater similarity with respect to the effects of constraints on particle placement (Figure 1.4).

MDS representation of 3rd line distances for the particle placement alternation. Distances between data points in plot are proportional to probabilistic grammar distances between varieties. From Szmrecsanyi et al. (2019).

Figure 1.4: MDS representation of 3rd line distances for the particle placement alternation. Distances between data points in plot are proportional to probabilistic grammar distances between varieties. From Szmrecsanyi et al. (2019).

The key idea is that differences in the variable grammars across varieties can be examined more easily, and more holistically, than they traditional sociolinguistic methods via such techniques. In standard Comparative Sociolinguistics studies the results of separate models are compared mainly by visual inspection of tables of model outputs (see e.g. Tagliamonte 2013:135, 145), and patterns of covariation across varieties can be difficult to discern. VADIS builds upon this approach by quantifying such patterns in ways that can then be used to model similarities among varieties more robustly. As in traditional comparative approaches, this method is holistic in that it focuses less on differences among specific predictors, and more on how the systems as a whole vary across varieties. The method certainly does not preclude closer investigation of individual predictors, and indeed we recommend such detailed analysis. Rather the VADIS method is intended to provide an alternative, “bird’s eye view” of the patterns across different varieties’ variable grammars.

1.2 Validating the method

We believe VADIS is a promising approach for scaling up and contemporizing comparative variationist methods, yet a few questions remain regarding its validity and reliability. For one, it is difficult to evaluate the validity of the method because we rarely have ways of independently measuring grammatical similarities (of the kind the method measures) among varieties. Correlations among other, usually external, factors are at best just correlations, and at worst invitations to confirmation bias. We would like to know that the method can adequately capture differences among variable grammars. A related problem is that we lack a baseline of meaningful (dis)similarity against which we can evaluate our results in any given case. In the spirit of other guides on meaningful effect sizes (e.g. Cohen 1992), we would like to have some idea what a range of small, medium, or large similarity scores is likely to be. Finally, it is vital to bear in mind that the statistics obtained from the underlying models, e.g. coefficient estimates in regression models, always come with some degree of error, and this should in principle be taken into account somehow. Ideally we would like to have a way of incorporating the variance in the underlying models into the VADIS output.

With these concerns in mind, the following report is guided by three research questions:

RQ 1. How valid are the measures and representations of grammatical distances? In other words, do the measures of similarity obtained via the VADIS method accurately reflect the true degree of similarity among grammars?

RQ 2. Can we determine a reasonable range of small, medium, or large degrees of grammatical similarity?

RQ 3. Can we capture the uncertainty in our underlying grammatical models in the VADIS output?

To answer these questions we simulate realistic datasets of the kind common in comparative sociolinguistic studies, and apply the VADIS method to them. These simulated datasets are specifically designed to vary in the degree of similarity among them, and thus allow us to test the validity and reliability of the method in a hypothetical scenario. We present the procedures for creating the simulated datasets in the next section, followed by a brief description of the modeling procedures. We then present the results of the VADIS analysis in the remaining sections. [Then a real world application to be added…]

2 Simulation overview

Our aim is to create simulated datasets in which we explicitly control how similar the variable grammars are be to one another, apply the VADIS modeling to them, and then examine how these differences are captured in the VADIS output. The basic idea is that more similar grammars should have higher similarity scores and cluster closer together in multidimensional scaling (MDS) and/or clustering representations, but we can also make more specific predictions based on how we construct the simulated grammars.

The simulation analysis proceeds as follows:

Simulation Step 1: We create variable grammars for a number of “varieties” based on set of dummy predictors with different constraint weightings that are predefined to be more or less similar to one another in specific ways.

Simulation Step 2: We create a set of simulated datasets for each “variety” representing different individual “speakers” within that variety. We generate a (binary) set of outcome variants for each dataset using the variable grammar for that respective variety. Each speaker in a variety is assumed to have a different baseline use of the outcome variants, which are captured in adjustments to the intercepts in the models used to generate the datasets.

Simulation Step 3: We fit separate regression models and random forest models to each dataset, and extract the respective statistical metrics for the VADIS method, i.e. the 3 ‘lines of evidence’. These are the fixed effects coefficients and their significance measures for the regression models (lines 2 and 1 respectively), and the permutation variable importances from the random forest models (line 3).

Simulation Step 4: We calculate the pairwise similarity scores for our datasets, from which we calculate the average score for each of line of evidence, as well as a combined score. We examine whether the coefficients accurately reflect with the degree of similarity in the simulated grammars.

Simulation Step 5: We use the distance matrices derived from the models to explore similarities among the datasets/varieties with dimension reduction techniques (MDS) and clustering methods. We then examine whether the predicted patterns are apparent in the visualizations.

Note that all simulations are modeling binary outcomes

2.1 Simulating probabilistic grammars

For the probabilistic variable grammars, we create a set of 8 test predictors which will correlate with a hypothetical binary outcome. All these predictors are designed to be representative of those found in natural language data. The predictors were designed as follows (more details below).

  • 3 binary factors
  • 1 categorical factor with 3 levels
  • 4 continuous predictors
    • 2 based on a normal distribution
    • 1 based on an F distribution (as approximated in e.g., normalized frequency distributions)
    • 1 based on a Poisson distribution (as in e.g., the number of words in a constituent)

Weightings for the predictors in all varieties are treated as adjustments on the logit scale, thus are comparable to coefficients in a logistic regression model. We use these predictors and weightings to create 5 distinct “variety grammars”:

Variety A. This is a baseline grammar with a reasonable range of predictor effect sizes (weightings).

Variety B. A mirror image of Variety A. Predictors have exact same effect size, but in the opposite direction. That is, weightings have identical absolute values but opposite signs.

Variety C. A third variety with predictor weightings that are very different from both Varieties A and B, yet are still within a reasonable range.

Variety D. A variety with predictor weightings equivalent to those of Variety C, yet randomly increased or decreased by 20%. For example, if predictor p has an effect size of 1 in Variety C, it would be randomly assigned a value of either 1.2 or 0.8 in Variety D. For predictor q = .4 in Variety C, q in Variety D would be either 0.48 or 0.32. And so on.

Variety E. ‘Frankenstein’ variety composed of 2 weightings taken from each of the other 4 varieties (2 from A, 2 from B, 2 from C, and 2 from D).

The set of weightings are given in table below. These are the simulated probabilistic grammars governing use of a hypothetical (binary) variable.

Table 2.1: Table of predictor weightings (logit scaled) for 5 simulated varieties
Var.A Var.B Var.C Var.D Var.E
Bin1 = Y 2.00 -2.00 0.40 0.320 2.00
Bin2 = Y -1.38 1.38 -1.00 -1.200 -1.00
Bin3 = Y 0.56 -0.56 1.00 0.800 1.00
Cont1 -0.05 0.05 0.03 0.024 0.05
Cont2 0.30 -0.30 -0.30 -0.360 -0.30
Cont3 0.69 -0.69 0.90 0.720 0.90
Cont4 1.80 -1.80 0.40 0.480 -0.69
Cat = ‘b’ -1.40 1.40 2.40 2.880 -1.40
Cat = ‘c’ -0.69 0.69 -0.90 -1.080 -0.08

In addition to these test predictors, we include 3 ‘noise’ predictors which are not associated with the outcome.

  • 3 noise predictors
    • 1 based on a continuous normal distribution
    • 1 based on a Poisson distribution
    • 1 binary factor

Again, these predictors are designed to be representative of features found in natural language data.

2.2 Simulating variety datasets

For the simulation we created datasets of 2000 observations each. For each observation we randomly generate possible values for 11 features corresponding to the 8 test predictors and 3 noise predictors. The hypothetical features are distributed as so:

  • 3 binary features
    • bin1: binary factor, P(1) = 0.35
    • bin2: binary factor, P(1) = 0.75
    • bin3: binary factor, P(1) = 0.6
  • 1 categorical feature with 3 discrete values
    • cat1: ‘a’, ‘b’, ‘c’; P(a) = .6; P(b) = .3; P(c) = .1
  • 4 continuous features
    • cont1: normal distribution, μ = 100, σ = 5
    • cont2: normal distribution, μ = 10, σ = 2
    • cont3: F distribution, F(d1 = 1, d2 = 10)
    • cont4: Poisson distribution, Poisson(λ = 2)
  • 3 noise features (no relation to the outcome)
    • noise1: normal distribution, μ = 100, σ = 5
    • noise2: Poisson distribution, Poisson(λ = 2)
    • noise3: binary factor, P(1) = 0.4

The values of the 8 test features, together with the weightings in Table 1, are used to calculate the the probability that the outcome variant in a given observation is 1 (vs. 0). So, for a given observation i,

\[P\left({\rm outcome}_i = 1\right) = \frac{1}{1+\exp(-X\beta)}, \\ \] where

\[\begin{eqnarray*} X\hat{\beta}= & & {\rm Intercept} \\ & & + \beta_1\:[{\rm bin1}] + \beta_2\:[{\rm bin2}] + \beta_3\:[{\rm bin3}] \\ & & + \beta_4\:[{\rm cont1}] + \beta_5\:[{\rm cont2}] + \beta_6\:[{\rm cont3}] + \beta_7\:[{\rm cont4}] \\ & & + \beta_8\:[{\rm cat1 = b}] + \beta_9\: [{\rm cat1 = c}]\\ \end{eqnarray*}\]

For the purpose of the calculation, bin1, bin2, and bin3 each have values of {0, 1}, and cat1[a] = 0, cat1[b] = 1, cat1[c] = 2.

For each hypothetical token, we then randomly assign an outcome of 0 or 1 from a binomial distribution using the probability of success calculated for that token from the formula above.

2.3 Simulating individual speakers

To take this simulation a step further, we simulate variation in preference at the level of individuals. The idea is that users of a given variety will share the same underlying grammar, but they may vary in their baseline preference for one variant or another. We do this by creating 15 datasets for each variety, where each dataset is generated using a different intercept. The intercepts are randomly sampled from a normal distribution with a mean of 0 and SD of .5, which gives us a range of values that is a reasonable approximation of natural variation where 95% of speakers will have a baseline proportion between .27 and .73. For each speaker we generate 2000 observations.

This results in a final tally of 75 distinct simulated datasets (5 varieties x 15 speakers), comprising 150000 total observations.

2.4 Step-wise variation

Finally, to more fully examine relative degrees of difference, we create a series of “stepped” varieties in which we incrementally increase the variance in predictor effects among ‘individuals’ within each ‘variety’. This results in a series of linguistic ‘communities’ with increasingly greater heterogeneity in individuals’ variable grammars.

We create these datasets by keeping the mean predictor effects constant across varieties, but we gradually increase the standard deviations of predictor effects among individuals within the variety. Specifically, at each step we increase the standard deviation of each predictor by 5% of its mean, and randomly sample values for each individual from a hypothetical normal distribution. For example, take a hypothetical predictor p with a mean effect size of 0.8 on the logit scale. At Step 0, we sample 15 values from a random normal distribution with mean of .8 and SD = 0, and assign those values to the 15 ‘individuals’ in ‘Variety 0’. In the next step, we again sample 15 values from a random normal distribution with mean of .8 but this time with a SD = .05 * .8 = 0.04, and assign those values to the 15 ‘individuals’ in ‘Variety 1’. For the next step we sample values from a random normal distribution with mean of .8 and SD = .1 * .8 = 0.08, and so on. We continue incrementing the multiplicative factor of the standard deviation by .05 up to a final step of .5, where we sample from a random normal distribution with mean of .8 and SD = .4. The results of this simulation are shown in Figure 2.1.

Predictor weightings (logit scaled) for 15 simulated individuals in 11 varieties with gradually increasing internal heterogeneity. Blue dots and lines represent the mean values +/- 1 SD.

Figure 2.1: Predictor weightings (logit scaled) for 15 simulated individuals in 11 varieties with gradually increasing internal heterogeneity. Blue dots and lines represent the mean values +/- 1 SD.

3 Modeling

To evaluate the VADIS method we consider two common scenarios in comparative variationist research. The first scenario involves aggregating across individuals to create a model of the community grammar. This is by far the most common scenario in comparative variationist research, if for no other reason than that we often lack sufficient numbers of observations for each individual to reliably model them separately. So in this scenario we are looking at comparisons at the level of the community, or variety, more generally.

The second scenario involves comparing the grammars of individual members of a linguistic community (or users of a variety). The assumption behind this second test case is that members of the same community to a large extent share the same underlying grammar, yet may vary with respect to their baseline preference for different variants. While this is an admittedly simplistic assumption, it is useful in that it provides us with a way of assessing the minimum distance that the method can be expected to detect in a real world situation. Put another way, it enables us to see just how much differences in the baseline proportion of the variants is likely to contribute to our measures of similarity. Comparing “individuals” this way provides an approximate ceiling to our similarity scores by providing an average similarity for data sampled from different individuals with the exact same underlying grammar. This helps address our second research question.

3.1 Similarities across communities

The full datasets for each variety consist of 30000 total observations (2000 tokens x 15 speakers), which is not a realistic dataset for typical variationist studies. Therefore, we down-sampled the data by randomly sampling 150 tokens from each speaker in each variety to give us 5 datasets of 2250 observations (150 tokens x 15 speakers) each. We then account for speaker variability by including a by-speaker intercept in our model structure.

f <- Variant ~ (1|Speaker) + bin1 + bin2 + bin3 + cat1 + cont1 + cont2 + cont3 + cont4 +
  noise1 + noise2 + noise3

For the first two lines of evidence, we fit Bayesian generalized linear mixed models with standardized inputs and reasonably informative normal priors (normal(0,2)) for the fixed effects (Gelman et al. 2008; Ghosh, Li & Mitra 2018). Models are fit using the {brms} package (Bürkner 2017). For the first line of evidence, which relies on statistical significance tests, we take a frequentist-like approach and simply use the 95% highest posterior density (HPD) interval. If the HPD interval for a given predictor does not include 0, the effect is treated as “significant.”2

For the third line of evidence we fit random forest models using the {ranger} package (Wright & Ziegler 2017) and calculate the permutation based measure of predictor importance (Altmann et al. 2010; Nicodemus et al. 2010).3

3.2 Similarities across individuals

For examining similarities among individuals within varieties, simple generalized linear models (glm) without random effects were fit to each of the 75 datasets. The model structure was the same as above, minus the by-speaker intercept:

f <- Variant ~ bin1 + bin2 + bin3 + cat1 + cont1 + cont2 + cont3 + cont4 +
  noise1 + noise2 + noise3

This same formula was used for the regression and random forest models. Separate, regression and random forest models were fit to each individual dataset.

4 Capturing grammatical similarity

We now turn to the result of applying the VADIS method to the simulated datasets. But before doing so, it’s worth reflecting on what we expect to see. In a nutshell, we expect to find the following basic patterns:

  • Varieties A and B should be quite distinct from one another, as their grammars are polar opposites.
  • Varieties C and D should be quite similar to each other, due to the relatively small differences in their constraints, yet both they should be quite distinct from both A and B.
  • Variety E should be different still from the others, as it was created from a combination of predictors from each variety
  • Inter-speaker variation within each variety should be considerably less than across varieties, with the possible exception of varieties C and D, since speakers differ only in their baseline frequencies. In other words, we expect speakers of the same variety to cluster tightly with one another.

To validate the VADIS method, we expect to see these patterns emerge in the distance/similarity measures and visualizations derived from those measures. Successful validation would result in clear sorting and/or clustering of our ‘individuals’ within varieties, as well as clear separation of (some of) those clusters across the varieties according to the patterns described above.

We

4.1 Distance matrices

We start with a consideration of the pairwise distances between each of the 75 distinct ‘individuals’, and consider the distances as assessed via the three ‘lines of evidence’: statistical significance (Line 1), effect size and direction (Line 2), and predictor importance ranking (Line 3). We use a heatmap plot to represent the pairwise distance matrices for each line of evidence.

4.1.1 Line 1: Statistical significance

The first line asks: to what extent are the same predictors statistically significance across the two varieties? The distance matrix for this line (Figure @ref()fig:dist1) is derived from a table of binary values using the squared euclidean distance. The resulting values are then divided by the total number of predictors to arrive at values between 0 and 1.

Figure 4.1: VADIS distance matrix for the 1st line of evidence in comparison of 75 hypothetical individuals across 5 hypothetical varieties.

The picture here suggest that this line of evidence alone is not very useful for distinguishing these hypothetical varieties. This illustrates one of the main limitations of this first line, which is that the distances are derived from a relatively small set of 11 binary values. Overall there is little information by which we can discriminate among the 75 datasets, and even a single change from one dataset to the next can have a substantial impact on the distance measure. With larger models containing perhaps dozens of predictors, all of which could potentially be significant, this line might be of more use (as tables of binary features are use in taxonomic studies). But such models are not realistic in variationist studies, so we’ll need to triangulate these results with other lines of evidence.

4.1.2 Line 2: Effect size

The second line asks: to what extent do the predictors’ effects have the same size and direction across the two varieties?4 The distance matrix for this line (Figure 4.2 is derived from a table of regression coefficients (on the logit scale) using the euclidean distance. The resulting disctances can be further normalized to, e.g., fall between 0 and 1. Normalization in this way provides a baseline against which similarities can be compared, and allows (in principle) the results of one variable to be compared to other variables to assess cross-varietal similarity of different linguistic variables. However, we present the raw (unnormalized) distances here, in order to investigate similarities among varieties with minimal distortion.

Figure 4.2: VADIS distance matrix for the 2nd line of evidence in comparison of 75 hypothetical individuals across 5 hypothetical varieties.

The picture here looks exactly like we want. The A and B varieties are least like one another, while the C and D varieties are most like one another. Variety E is different from all the others, though perhaps slightly more similar to C and D than A and B. Moreover, the intra-variety variability is very small compared to the differences across varieties as a whole. This is exactly as designed in the simulated data, though we note that such as high degree of intravarietal homogeneity is unlikely to show up in the real world. There is surely greater variability among individual language users than this simulation implies.

4.1.3 Line 3: Predictor importance

The third line asks: to what extent do the predictors’ relative importance in the grammar? The distance matrix for this line (Figure 4.3 is derived from a table of random forest variable importance rankings, where distance is measured as 1 - the Spearman rank correlation ρ.

Figure 4.3: VADIS distance matrix for the 3rd line of evidence in comparison of 75 hypothetical individuals across 5 hypothetical varieties.

As with Line 2, we find clearer distinctions between varieties than in Line 1, but the pattern is different from Line 2. Specifically we see suggestions of 3 main variety clusters: C and D, A and B, and E. This is not in line with the true patterns among varities, but it is what we should expect from this line of evidence.

Recall that varieties C and D were designed to be relatively similar to one another in a very particular way, i.e. by adjusting the effect sizes of the predictors by 20%, and so the relative ranking of predictors is not likely to vary much between (speakers of) the two varieties. Thus the two varieties are in fact qualitatively quite similar in their underlying grammars, and this is captured in the low distance scores.

The situation with varieties A, B, and E is different however. For one, why are A and B so similar, when we desinged them to be opposites? Again recall that varieties A and B were constructed to have predictors with identical magnitudes but opposite directions, which means that these varieties’ grammars are qualitatively very different. However, random forest predictor importance measures only assess the contribution of a predictor to the model’s overall ability to predict the outcome, and do not take into account the direction of the effects. Thus, in the random forest models for A and B, the predictors have the same relative importance in both varieties even though they have opposite effect directions.

Variety E, of course, is different from the others, as it should be.

4.2 Multidimensional scaling

4.2.1 Line 1: Statistical significance

Multidimensional scaling plots are an easy way to represent distances, and conversely similarities, visually. We’ll start with plots from the first line of evidence, which considers the similarities in the statistical significance of the predictors across varieties. A 3D MDS plot (Figure Figure 2) shows that there is indeed some separation among the varieties in the ways we expect, but it is not very clean.

There is a lot of overlap between varieties A and B, and between varieties C and D. There’s also a lot of overlap among individuals, reflected by the fact that many point are plotted overtop one another.

Figure 4.4: 3D MDS plot of similarities among varieties and speakers based on Line 1 (statistical significance)

The overlapping points here illustrate one of the main limitations of this first line, which is that the distances are derived from a relatively small set of 12 binary values. Overall there is relatively little information by which we can discriminate among the 75 datasets, and even a single change from one dataset to the next can have a substantial impact on the distance measure. With larger models containing perhaps dozens of predictors, all of which could potentially be significant, this line might be of more use (e.g. as tables of binary features are use in taxonomic studies), but such models are not realistic in variationist studies, so we’ll need to triangulate these results with other lines of evidence.

Finally, in Variety E we notice a couple outlying speakers, b and g. It’s not immediately clear why this is, though it is likely that the models of these particular datasets had some near separation problems, which affect the parameter estimates.

4.2.2 Line 2: Effect size & direction

Turning to the second line, effect size/direction, the results are much more promising (Figure 4.5). We see clearly separated clusters representing the respective varieties. The patterns we describe above are largely borne out in the plot. But again we find the same two outliers in Variety E.

This is even easier to see in three dimensions.

Figure 4.5: 3D MDS plot of similarities among varieties and speakers based on Line 2 (effect size)

We get a very nice picture here. The varieties are all clearly separated from one another, with the exception of C and D, which cluster relatively closely together. This was of course by design, as the predictor effect sizes of variety D were set to be equivalent to those of variety C but adjusted by 20%. Varieties A and B are maximally distant from each other in the plot space, and varieties C, D, and E are separated from these along different axes. This is exactly what we wanted to see based on how we designed the predictor effect sizes.

4.2.3 Line 3: Importance ranking

Turning to the third line of evidence, we see results similar in robustness to those in line 2, but with a slightly different pattern (Figure 4.6). There is clear separation among the varieties, similar to Line 2 , but there are some key differences that are worth examining further.

Figure 4.6: 3D MDS plot of similarities among varieties and speakers based on Line 3 (importance ranking)

Rather than the 5 distinct clusters we see in Figure 4.5, we find instead 2 loose clusters: a cluster of speakers of varieties A, B and E; and a somewhat looser cluster of speakers of varieties C and D. The two clusters are particularly interesting because they cluster together for different reasons, and they illustrate one of the limitations of the third line of evidence.

Recall that varieties C and D were designed to be relatively similar to one another in a very particular way, i.e. by adjusting the effect sizes of the predictors by 20%, and so the relative ranking of predictors is not likely to vary much between (speakers of) the two varieties. Thus the two varieties are in fact qualitatively quite similar in their underlying grammars, and this is captured in the loose clustering in the MDS plot.

The situation with varieties A, B, and E is different however. Again recall that varieties A and B were constructed to have predictors with identical magnitudes but opposite directions, which means that these varieties’ grammars are qualitatively very different. However, random forest predictor importance measures only assess the contribution of a predictor to the model’s overall ability to predict the outcome, and do not take into account the direction of the effects. Thus, in the random forest models, the predictors have the same relative importance in both varieties even though they have opposite effect directions.

Furthermore, importance rankings are often heavily skewed, with many predictors showing relatively little impact, and the rank ordering of weak predictors can vary randomly between datasets. This is illustrated in Figure 4.7 which plots the mean predictor importances for each of the 5 varieties, with standard deviations.

Mean variable importance (with SD) for each variety

Figure 4.7: Mean variable importance (with SD) for each variety

Note the differences in rankings among even the noise predictors. This random variability in the tails of the distributions is likely responsible for variability within varieties in Figure 4.6. As with Line 1, the results for Line 3 illustrate the potential pitfalls of relying on a single line of evidence.

4.3 Clustering methods

Clustering methods offer alternative perspectives on the patterns among varieties, and are easy to implement. For example, 4.8 shows a hierarchical cluster analysis based on the Line 2 distance matrix. The analysis cleanly identifies the same 5 distinct clusters shown in the MDS plot above (Figure @ref(fig:line2a_mds)).

Hierarchical clustering of datasets based on VADIS Line 2

Figure 4.8: Hierarchical clustering of datasets based on VADIS Line 2

Similarly, a cluster analysis based on Line 3 also parallels the respective MDS plot, identifying two clear clusters: C & D and A, B & E. It also suggests a third cluster containing mostly datasets from Variety E. This cluster is also somewhat identifiable in the MDS plot.

Hierarchical clustering of datasets based on VADIS Line 3

Figure 4.9: Hierarchical clustering of datasets based on VADIS Line 3

Other clustering methods, e.g. Neighbor nets (Bryant & Moulton 2004) are becoming popular alternatives to traditional clustering methods as well (e.g. Dunn et al. 2008; Grafmiller & Szmrecsanyi 2018).

5 Quantifying similarity

5.1 Within variety similarity

We first look at the similarity among individuals within the same variety. To estimate similarities we take the distance matrices from each line of evidence and calculate the mean pairwise distance for each individual and subtract that value from 1. This provides an average similarity score for each individual ranging from 1 (maximal similarity, i.e. identity) to 0. In the same spirit, we can take the average pairwise distance across the entire variety to obtain a similarity score reflecting the degree of internal homogeneity within a given variety. Recall that the datasets for the five variety were created from 15 ‘individuals’ with the same variable grammars, albeit with varying baseline frequencies of the outcome (intercepts). We thus expect the VADIS lines to return internal homogeneity scores that are quite close to 1. We interpret these values as the maximal degree of similarity, or conversely the minimal grammatical distance, that the method is capable of detecting, and therefore they represent an approximate practical ceiling (or floor) against which other values may be evaluated.

For the first Line of evidence we find that the average degree of inter-speaker similarity is quite high across the five varieties (Table 5.1). This is not surprising considering the results presented above for the MDS plots. We don’t expect to see much interpretable variability here, yet we do find a consistently high degree of variety-internal consistency.

Table 5.1: Average speaker-to-speaker Similarity scores within each variety as estimated by Line 1 (statistical significance)
Variety Mean Median SD
A 0.938 0.917 0.064
B 0.897 0.917 0.065
C 0.916 0.917 0.062
D 0.884 0.917 0.074
E 0.957 1.000 0.062

We find a similar degree of homogeneity for Lines 2 (Table 5.2) and 3 (Table 5.3).

Table 5.2: Average speaker-to-speaker Similarity scores within each variety as estimated by Line 2 (effect size/direction)
Variety Mean Median SD
A 0.857 0.864 0.040
B 0.851 0.859 0.043
C 0.866 0.872 0.036
D 0.883 0.891 0.027
E 0.882 0.880 0.022
Table 5.3: Average speaker-to-speaker Similarity scores within each variety as estimated by Line 3 (importance ranking)
Variety Mean Median SD
A 0.931 0.927 0.034
B 0.928 0.936 0.045
C 0.892 0.900 0.063
D 0.910 0.927 0.063
E 0.961 0.964 0.021

Overall we find that the average within-variety grammatical similarity between individual speakers, who vary only with respect to their baseline frequency, is around .9 for lines 1 and 3, and a bit lower, around .85 to .9 for line 2. We can combine the three lines to derive an overall similarity score representing the internal homogeneity for each variety.

Table 5.4: Speaker-to-speaker Similarity scores within each variety averaged across the three lines.
Variety mean median
A 0.909 0.917
B 0.892 0.917
C 0.891 0.900
D 0.892 0.917
E 0.933 0.964

These datasets were constructed to simulate the minimal amount of grammatical variation that we could realistically expect to find within a homogeneous community of speakers. We note that even this simulation is an overly conservative one—it’s very unlikely that any true community of users ever exhibits the degree of homogeneity simulated here. These findings suggest that pairwise similarity scores of .85 or above represent cases of grammars that are indistinguishable from one another in any meaningful sense.

5.2 Cross-variety similarity

We now turn to the case of cross-variety comparisons of the 5 “community” varieties, examining the different lines of evidence in turn. This scenario is more representative of work in standard Comparative Sociolinguistics. We reiterate the planned differences among the 5 varieties here:

Variety A. A baseline grammar with a reasonable range of predictor weightings. Weightings for all varieties are treated as adjustments on the logit scale, thus are comparable to coefficients in a logistic regression model.

Variety B. A mirror image of Variety A. Predictors have exact same effect size, but in the opposite direction. That is, weightings have equivalent absolute values but opposite signs.

Variety C. A third variety with predictor weightings that are very different from both Varieties A and B, yet are still within a reasonable range.

Variety D. A variety with predictor weightings equivalent to those of Variety C, yet randomly increased or decreased by 20%. For example, if predictor P has an effect size of 1 in Variety C, it would have a value of either 1.2 or 0.8 in Variety D.

Variety E. A ‘Frankenstein’ variety composed of 2 weightings taken from each of the other 4 varieties (2 from A, 2 from B, 2 from C, and 2 from D).

As with the visualizations, we expect to find the following basic patterns:

  • Varieties A and B should be quite dissimilar, as their grammars are polar opposites.
  • Varieties C and D should be quite similar to each other, due to the relatively small differences in their constraints, yet both they should be distant from both A and B.
  • Variety E should be different still from the others, as it was created from a combination of predictors from each variety

To compare varieties at the level of the community, we aggregated over the data from the 15 individuals in each variety, and fit a single model per variety. To create realistic datasets, we down-sampled the data by randomly sampling 150 tokens from each speaker in each variety to give us 5 datasets of 2250 observations (150 tokens x 15 speakers) each. We account for speaker variability by including a by-speaker intercept in our model structure.

The pairwise distance matrix for Line 1 is shown in Table 5.5. Recall that this line considers whether the same predictors are significant across varieties, and smaller distances reflect greater cross-variety consistency in the predictor significance. In terms of the relative degree of similarity among varieties, we find that varieties C and D are identical to one another, and closer to A and B than to E. A and B are relatively dissimilar to one another, and A and E represent the two most dissimilar varieties. The actu

Table 5.5: Pairwise variety distance matrix as estimated by Line 1 (statistical significance)
A B C D E
A 0
B 0.167 0
C 0.083 0.083 0
D 0.083 0.083 0 0
E 0.333 0.167 0.25 0.25 0

From Table 5.5 we calculate the similarity scores by taking the average pairwise distance for each variety and subtracting 1 (Table 5.6). Based on this table we can see that Variety E is on average the most distinct variety according to the first line of evidence.

Table 5.6: Average Similarity scores across varieties as estimated by Line 1 (statistical significance)
Variety Mean Similarity
A 0.833
B 0.875
C 0.896
D 0.896
E 0.750

We next turn to Line 3, which considers correlations among the predictor importance rankings across the varieties (Table 5.5). The first thing to note is that the absolute distances are larger for this line than those derived for Line 1. This is due to the greater amount of information in the table of rankings from which the distances are calculated.

Table 5.7: Pairwise variety distance matrix as estimated by Line 3 (predictor importance)
A B C D E
A 0
B 0.109 0
C 0.555 0.336 0
D 0.6 0.4 0.082 0
E 0.118 0.036 0.382 0.445 0

Curiously, we find that Variety E is not the most dissimilar variety, but rather D is. Overall, the scores are generally lower than for Line 1, but this is to be expected given the greater variability in the rankings.

Table 5.8: Average Similarity scores across varieties as estimated by Line 3 (predictor importance)
Variety Mean Similarity
A 0.655
B 0.780
C 0.661
D 0.618
E 0.755

Now consider Line 2, which considers correlations among the predictor effect sizes, measured as the regression model coefficients (Table 5.9). The first thing to note is that the distance scores for some comparisons are greater than 1, which leads to negative similarity scores as seen in Table 5.10.

Table 5.9: Pairwise variety distance matrix as estimated by Line 2 (effect size/direction)
A B C D E
A 0
B 2.119 0
C 1.482 1.893 0
D 1.421 1.815 0.127 0
E 1.235 2.255 1.039 0.973 0
Table 5.10: Average Similarity scores across varieties as estimated by Line 2 (effect size/direction)
Variety Mean Similarity
A -0.564
B -1.021
C -0.135
D -0.084
E -0.375

The reason for this is that the ‘maximal reasonable distance’ weighting applied to the distance matrix during normalization (Szmrecsanyi, Grafmiller & Rosseel 2019:5) is actually smaller than the maximum distances in the simulated data. In other words, the simulated varieties here are in fact much more dissimilar to one another than we’d ever theoretically expect to see in real data from closely related dialects.5

To makes sense of our simulated data we then have two options. The first is to simply interpret the similarity scores as is. Since similarity is measured as 1 - the (average) distance, we can interpret the negative values along the same continuum. Lower negative values reflect lower average similarity and position values indicate greater average similarity, and 0 is not necessarily meaningful. Based on this we can see that variety B is the least similar to the others, while D and C are the most similar.

Alternatively, the distance matrix weighting can be increased to keep the distance values between 0 and 1 (in accord with Lines 1 and 3). Table 5.11 shows the distance matrix with a weighting of 3, and the corresponding similarity scores in Table 5.12. Note that the same relative trends hold between the similarity scores in Table 5.10 and Table 5.12, B is still the least similar and D the most similar, only the absolute values have changed.

brm_line2b <- VADIS::vadis_line2(brm_list, path = F, weight = 3)
Table 5.11: Pairwise variety distance matrix as estimated by Line 2 (effect size/direction). Weighting adjusted to 3.
A B C D E
A 0
B 0.706 0
C 0.494 0.631 0
D 0.474 0.605 0.042 0
E 0.412 0.752 0.346 0.324 0
Table 5.12: Average Similarity scores across varieties as estimated by Line 2 (effect size/direction). Weighting adjusted to 3.
Variety Mean Similarity
A 0.479
B 0.326
C 0.622
D 0.639
E 0.542

The differences here highlight the sensitivity of these scores to additional degrees of freedom in the modeling process, which we are currently working to reduce. While the relative similarities of

Finally, we turn to the combined evidence from all three lines (Table 5.13). Here we can see that as planned, varieties C and D are extremely similar to one another, while the other varieties are much more distant.

Table 5.13: Fused pairwise variety distance matrix as estimated by all 3 Lines of evidence combined. Weighting for Line 2 = 1.
A B C D E
A 0
B 0.541 0
C 0.611 0.55 0
D 0.627 0.574 0.064 0
E 0.582 0.52 0.616 0.641 0
Table 5.14: Average Similarity scores across varieties as estimated by all 3 Lines of evidence combined.
Mean.Similarity
A 0.40975
B 0.45375
C 0.53975
D 0.52350
E 0.41025

The average similarity scores are largely even, with C and D again being slightly more similar on average, though their slight bump here is due almost entirely to their high degree of similarity with one another. Based on the conclusions from the within variety scores in Section 5.1, we would be justified in concluding that based on their pairwise distance of .06 (Similarity = 1 - .06 = .94), the variable grammars of Varieties C and D are not meaningfully different from one another.

At the same time, it is not yet clear how to interpret values below .85 or so. The distances obtained through the combined lines of evidence in Table 5.13 suggest that distances close to .5 - .6 reflect a very low degree of similarity, however this is only applicable to the combined lines. Distances from the individual lines can vary considerably in their absolute values, and we therefore tentatively suggest that comparing scores is only meaningful based on the combined lines (if at all).

5.3 Stepwise similarity

To be continued…

6 Representing uncertainty

The final issue we want to address is the inherent uncertainty in the models used to estimate the variable grammars, and how such uncertainty might be represented in the VADIS output. The standard VADIS method uses regression models for Lines 1 and 2, and random forest models for Line 3. As we’ve seen, the usefulness of Line 1 for visualizing distances is quite limited, and is obviously correlated with Line 2, so we will not focus on it here. With Line 2, the outputs of standard regression model tools already provide measures of uncertainty (confidence), e.g. as standard errors for the coefficient estimates. The challenge is how to incorporate that information into our representations. With Line 3, however, it’s not obvious how to derive uncertainty estimates from the random forest variable importance rankings, except perhaps through bootstrapping or cross-validation, and so we set this aside for further investigation.

An advantage of using Bayesian regression models is that we can straightforwardly sample sets of coefficients from the posterior distributions and use these to represent the variability in the model estimates.6 To do this, we randomly sample sets of coefficient estimates directly from the posterior distributions, and for each sample compute a distance matrix and unique set of MDS coordinates mapping out our ‘varieties’ in 2 or 3 dimensional space. These coordinates can then simply be overlaid onto one another in a single graphic representation to create ‘variety clouds’ (or ellipses), whose density and degree of overlap reflect the degree of uncertainty in the relative similarities among the varieties.

To test this we use the same down-sampled variety datasets used in Section 5.2 above, and run the VADIS analysis on them. Figures 6.1 and 6.2 show MDS plots of Line 2 distance matrices calculated from 200 sets of coefficients which were randomly sampled from the posterior distributions of the five variety models.

MDS plots of similarities among varieties based on 200 samples  Line 2 (effect size). Points represent individal samples from the GLMM posterior distribution with ellipses representing 50% (solid) and 90% (dashed) multivariate normal distributions.

Figure 6.1: MDS plots of similarities among varieties based on 200 samples Line 2 (effect size). Points represent individal samples from the GLMM posterior distribution with ellipses representing 50% (solid) and 90% (dashed) multivariate normal distributions.

Figure 6.2: £D MDS plot of similarities among varieties based on 200 samples Line 2 (effect size). Points represent individal samples from the GLMM posterior distribution.

The tight clusters show that the coordinates from the MDS plots are highly stable, and similarities among the simulated varieties derived from the model statistics are very robust. This provides a clearly successful proof of concept for the VADIS method and the usefulness of the visualizations it can produce. Of course, real data are never so neat, and considerable work remains to test to what extent the method can be successfully applied to actual case studies.

In addition to the MDS plots we can look at the distributions of average similarity scores across these samples. We find that as in the MDS plot, in this case variety B consistently shows the lowest average similarity, while varieties C and D are considerably more similar to one another on average.

Distributions of the mean similarity scores across the 5 varieties for 200 sets of coefficients randomly sampled  from the GLMM posteriors. Higher values reflect greater average grammatical similarity of the variety with all the others.

Figure 6.3: Distributions of the mean similarity scores across the 5 varieties for 200 sets of coefficients randomly sampled from the GLMM posteriors. Higher values reflect greater average grammatical similarity of the variety with all the others.

The findings above largely corroborate those of the MDS plots.

7 Conclusion

This simulation study was motivated by three research questions regarding the validity and reliability of the new VADIS method:

RQ 1. How valid are the measures and representations of grammatical distances? In other words, do the measures of similarity obtained via the VADIS method accurately reflect the true degree of similarity among grammars?

RQ 2. Can we determine a reasonable range of small, medium, or large degrees of grammatical similarity?

RQ 3. Can we capture the uncertainty in our underlying grammatical models in the VADIS output?

In answer to the first question, we find that the method is capable of accurately capturing genuine grammatical differences partly through the similarity scores, but mainly through the visual representations in the MDS plots. While the method does not provide tests of statistical significance, it can provide a useful way of exploring grammatical similarity among varieties in a more holistic fashion. Regarding the second question, we find that distances below .1 (or Similarity scores > .9) represent practically indistinguishable variety grammars. We can also make a tentative case for scores of .5 as representing a very low degree degree of similarity, though we caution that this applies only to scores based on the combined lines of evidence. Lastly, we show that the inherent uncertainty in the underlying predictive models can in principle be represented in the visualizations and similarity scores via MDS ‘variety clouds’ and the distributions of similarity scores.

In sum, the simulation study here demonstrate the potential value of the VADIS method as a unique and poweful tool for interpretation in comparative variationist analysis. We find that, at least in principle, the method is offers a reliable approach to exploring grammatical similarity among different variety’s linguistic variables. We also show that the method can easily scale up to compare dozens of varieties (dialects, individuals,…) at a time. Further testing and application in real world contexts is surely needed, and while the study here considers only similarity with respect to a single variable, it is possible, and desirable, that the method could be expanded/adapted to explore lectal coherence across multiple linguistic variables at a time (see e.g. ???; ???; Szmrecsanyi, Grafmiller & Rosseel 2019), and this is an avenue that we are actively pursuing.

References

Altmann, André, Laura Toloşi, Oliver Sander & Thomas Lengauer. 2010. Permutation importance: A corrected feature importance measure. Bioinformatics 26(10). 1340–1347. doi:10/cm7h6d.

Blaxter, Tam, Kate Beeching, Richard Coates, James Murphy & Emily Robinson. 2019. Each p[ɚ]Son does it th[&epsilon;:] Way: Rhoticity variation and the community grammar. Language Variation and Change 31(1). Cambridge University Press. 91–117. doi:10.1017/S0954394519000048.

Bryant, David & Vincent Moulton. 2004. Neighbor-Net: An agglomerative method for the construction of phylogenetic networks. Molecular Biology and Evolution 21(2). 255–265. doi:10.1093/molbev/msh018.

Bürkner, Paul-Christian. 2017. Brms : An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software 80(1). doi:10/gddxwp.

Cedergren, Henrietta J. & David Sankoff. 1974. Variable Rules: Performance as a Statistical Reflection of Competence. Language 50(2). 333. doi:10.2307/412441.

Childs, Claire. 2017. Variation and change in English negation : A cross-dialectal perspective. Newcastle: Newcastle University Ph.D. Thesis.

Claes, Jeroen. 2016. Cognitive, social, and individual constraints on linguistic variation: A case study of presentational, haber’ pluralization in Caribbean Spanish. (Cognitive Linguistics Research volume 60). Berlin ; Boston: De Gruyter Mouton.

Cohen, Jacob. 1992. A power primer. Psychological Bulletin 112(1). 155–159. doi:10.1037/0033-2909.112.1.155.

Dunn, Michael, Stephen C. Levinson, Eva Lindström, Ger Reesink & Angela Terrill. 2008. Structural phylogeny in historical linguistics: Methodological explorations applied in Island Melanesia. Language 84(4). 710–759. doi:10/dwd463.

Gelman, Andrew, Aleks Jakulin, Maria Grazia Pittau & Yu-Sung Su. 2008. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics 2(4). 1360–1383. doi:10.1214/08-AOAS191.

Ghosh, Joyee, Yingbo Li & Robin Mitra. 2018. On the use of Cauchy prior distributions for Bayesian logistic regression. Bayesian Analysis 13(2). 359–383. doi:10/gdfv6g.

Grafmiller, Jason & Benedikt Szmrecsanyi. 2018. Mapping out particle placement in Englishes around the world: A study in comparative sociolinguistic analysis. Language Variation and Change 30(3). 385–412. doi:10/gf4p2w.

Gries, Stefan Th. 2019. On classification trees and random forests in corpus linguistics: Some words of caution and suggestions for improvement. Corpus Linguistics and Linguistic Theory 0(0). doi:10.1515/cllt-2018-0078.

Makowski, Dominique, Mattan S. Ben-Shachar, S. H. Annabel Chen & Daniel Lüdecke. 2019. Indices of effect existence and significance in the Bayesian framework. Frontiers in Psychology 10. Frontiers. doi:10/ggfw2j.

Nicodemus, Kristin K, James D Malley, Carolin Strobl & Andreas Ziegler. 2010. The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinformatics 11(1). 110. doi:10.1186/1471-2105-11-110.

Ortin, Ramses & Carmen Fernandez-Florez. 2019. Transfer of variable grammars in third language acquisition. International Journal of Multilingualism 16(4). Routledge. 442–458. doi:10.1080/14790718.2018.1550088.

Poplack, Shana & Sali Tagliamonte. 2001. African American English in the diaspora. (Language in Society 30). Malden, MA: Blackwell.

Sankoff, David, Sali Tagliamonte & Eric Smith. 2015. Goldvarb Yosemite: A variable rule application for Macintosh. Toronto: Department of Linguistics, University of Toronto.

Szmrecsanyi, Benedikt, Jason Grafmiller & Laura Rosseel. 2019. Variation-Based Distance and Similarity Modeling: A Case Study in World Englishes. Frontiers in Artificial Intelligence 2. doi:10.3389/frai.2019.00023.

Tagliamonte, Sali. 2012. Variationist Sociolinguistics: Change, Observation, Interpretation. (Language in Society 40). Malden, MA: Wiley-Blackwell.

Tagliamonte, Sali. 2013. Comparative Sociolinguistics. In J. K. Chambers & Natalie Schilling (eds.), Handbook of Language Variation and Change, 130–156. Second. Chichester, West Sussex, United Kingdom: John Wiley & Sons Inc.

Wright, Marvin N. & Andreas Ziegler. 2017. Ranger : A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 77(1). doi:10.18637/jss.v077.i01.


  1. Mixed-effects methods for random forests are not fully functional at the moment, so we cannot model the kinds of multilevel variation with random forests that we can with mixed-effects regression models. Therefore the random forest model structure is usually equivalent to the fixed-effects structure of the corresponding regression model.↩︎

  2. The {VADIS} package provides several other methods for assessing “significance” in Bayesian models, such as the probability of direction, Region of Practical Equivalence (ROPE), and Maximum A Posteriori (MAP) based p-values (Makowski et al. 2019).↩︎

  3. We use the {ranger} package mainly for its extremely fast implementation, and we think this is reasonable given that the simulated datasets were designed in a way to minimize multicollinearity, which can impact the reliability of importance measures. Other random forest methods, e.g. those implemented in the {party} package, are arguably more appropriate due to the well-known issues of multicollinearity in natural language data, but this remains an area of some contention (Gries 2019). Practically speaking however, the very large computational cost of calculating conditional variable importance scores with {party} becomes prohibitive as the number of comparisons increases beyond even a few datasets (we fit 75 models here). At the moment, further testing of different methods on real language data is needed to identify the strengths and weaknesses of different random forest methods for variationist studies.↩︎

  4. This is effectively a modern adaptation of the notion of the ‘constraint hierarchy’ in traditional Comparative Sociolinguistics methods (Tagliamonte 2013), which developed out the longstanding VARBRUL tradition (Cedergren & Sankoff 1974; Sankoff, Tagliamonte & Smith 2015).↩︎

  5. It remains to be seen whether this hypothesis turns out to be true. For now at least, we are unaware of real world cases where such extreme differences can be found.↩︎

  6. We are experimenting with methods for working with non-Bayesian models, e.g. by sampling from simulated parameter distributions based the coefficient point estimates and their standard errors, but this has not been implemented yet.↩︎